Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora

نویسندگان

  • Daniel Ramage
  • David Leo Wright Hall
  • Ramesh Nallapati
  • Christopher D. Manning
چکیده

A significant portion of the world’s text is tagged by readers on social bookmarking websites. Credit attribution is an inherent problem in these corpora because most pages have multiple tags, but the tags do not always apply with equal specificity across the whole document. Solving the credit attribution problem requires associating each word in a document with the most appropriate tags and vice versa. This paper introduces Labeled LDA, a topic model that constrains Latent Dirichlet Allocation by defining a one-to-one correspondence between LDA’s latent topics and user tags. This allows Labeled LDA to directly learn word-tag correspondences. We demonstrate Labeled LDA’s improved expressiveness over traditional LDA with visualizations of a corpus of tagged web pages from del.icio.us. Labeled LDA outperforms SVMs by more than 3 to 1 when extracting tag-specific document snippets. As a multi-label text classifier, our model is competitive with a discriminative baseline on a variety of datasets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dirichlet Process with Mixed Random Measures: A Nonparametric Topic Model for Labeled Data

We describe a nonparametric topic model for labeled data. The model uses a mixture of random measures (MRM) as a base distribution of the Dirichlet process (DP) of the HDP framework, so we call it the DPMRM. To model labeled data, we define a DP distributed random measure for each label, and the resulting model generates an unbounded number of topics for each label. We apply DP-MRM on single-la...

متن کامل

Feature extraction of hyperspectral images using boundary semi-labeled samples and hybrid criterion

Feature extraction is a very important preprocessing step for classification of hyperspectral images. The linear discriminant analysis (LDA) method fails to work in small sample size situations. Moreover, LDA has poor efficiency for non-Gaussian data. LDA is optimized by a global criterion. Thus, it is not sufficiently flexible to cope with the multi-modal distributed data. We propose a new fea...

متن کامل

Regularized Semi-supervised Latent Dirichlet Allocation for Visual Concept Learning

Topic model is a popular tool for visual concept learning. Most topic models are either unsupervised or fully supervised. In this paper, to take advantage of both limited labeled training images and rich unlabeled images, we propose a novel regularized Semi-Supervised Latent Dirichlet Allocation (r-SSLDA) for learning visual concept classifiers. Instead of introducing a new complex topic model,...

متن کامل

An Online Inference Algorithm for Labeled Latent Dirichlet Allocation

Using topic models to analyze documents is a popular method in text mining. Labeled Latent Dirichlet Allocation(Labeled LDA) is one of them that is widely used to model tagged documents and to solve relevant problems, such as tagged document visualization, snippet extraction and so on. However, traditional batch inference for Labeled LDA, which runs over entire document collection, is computati...

متن کامل

Semi-supervised Bibliographic Element Segmentation with Latent Permutations

This paper proposes a semi-supervised bibliographic element segmentation. Our input data is a large scale set of bibliographic references each given as an unsegmented sequence of word tokens. Our problem is to segment each reference into bibliographic elements, e.g. authors, title, journal, pages, etc. We solve this problem with an LDA-like topic model by assigning each word token to a topic so...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009